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Abstract 

This  paper  documents  progress  to  date  on  a  research 
project,  the  goal  of  which  is  wartime-event  prediction.  It 
describes  the  operational  concept,  the  data-mining  envi¬ 
ronment,  and  data-mining  techniques  that  use  Bayesian 
networks  for  classification.  Key  steps  in  the  research  plan 
are  as  follows:  I)  implement  machine  learning;  2)  test  the 
trained  networks;  and  3)  use  the  technique  to  support  a 
battlefield  commander  by  predicting  enemy  attacks.  Data 
for  training  and  testing  the  technique  can  be  extracted 
from  the  object-oriented  database  that  supports  the  Inte¬ 
grated  Marine  Multi-Agent  Command  and  Control  System 
(IMMACCS).  These  data  were  derived  from  message  traf¬ 
fic  generated  during  U.S.  Marine  Corps  exercises.  The 
class  structure  in  the  IMMACCS  data  model  is  especially 
well  suited  to  support  attack  classification. 

Keywords  -  Bayesian  networks,  classifiers,  command 
and  control,  data-mining  algorithms,  military  exercise, 
object-oriented  database,  readiness,  U.S.  Marine  Corps 

1.  Introduction 

The  ability  to  predict  attacks  and  other  hostile  events 
during  times  of  conflict  is  very  important  to  military 
commanders  from  the  standpoint  of  readiness.  The  more 
advanced  the  notice  and  the  more  widespread  the  notifica¬ 
tion,  the  better  able  all  echelons  are  to  respond  to  threats 
efficiently  and  with  the  correct  combination  of  forces. 

The  literature  is  replete  with  recent  research  results  on 
data  mining  and  data  classification.  (See,  for  example,  [3, 
4,  5  and  10].)  Data  mining,  data  classification  and  data 
correlation  are  related  to  data  fusion.  As  these  techniques 
mature,  better  tools  become  available  to  model  and  to  cor¬ 
relate  data  from  complex  operational  scenarios.  The  pur- 
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pose  of  this  research  is  to  create  and  extend  a  method  to 
predict  attacks  on  the  U.S.  Marine  Corps  using  an  object- 
oriented  command  and  control  database  and  data-mining 
techniques  [9], 

The  paper  is  organized  as  follows.  Section  2  describes 
data  mining  and  its  importance  in  a  military  context.  Sec¬ 
tion  3  consists  of  a  concept  of  operations.  Section  4  ex¬ 
plains  the  general  approach  to  the  problem  of  predicting 
attacks  during  wartime  and  other  periods  of  tension  or  con¬ 
flict  Section  5  describes  Bayesian  networks  as  well  as 
examples  of  software  that  generates  and  uses  them.  Section 
6  discusses  plans  to  ^implement  specific  versions  of  this 
software  in  the  research  environment.  Section  7  describes 
the  use  of  object-oriented  data  for  training  and  testing,  as 
well  as  research  issues  associated  with  data-set  selection 
Section  8  summarizes  the  paper  and  briefly  discusses  future 
directions.  The  paper  concludes  with  Appendix  A,  a  bibli¬ 
ography  of  data-mining  and  related  literature  listed  by  cate¬ 
gory. 

2.  Data  mining 

Data  mining  is  the  search  for  and  extraction  of  hidden 
and  useful  patterns,  structures  and  trends  in  large,  multi¬ 
dimensional,  and  heterogeneous  data  sets  that  were  col¬ 
lected  originally  for  another  purpose.  (See,  for  example, 
[10].)  Data  mining  is  an  art  that  is  supported  by  a  consid¬ 
erable  body  of  science,  engineering  and  technology.  For 
example,  data  mining  uses  techniques  from  such  diverse 
areas  as  data  management,  statistics,  artificial  intelligence, 
machine  learning,  pattern  recognition,  data  visualization, 
and  parallel  and  distributed  computing.  Data  mining  is 
possible  today  because  of  advances  in  these  marry  fields; 
however,  this  multidiciplinary  characteristic  also  makpg 
data  mining  a  difficult  subject  to  teach  and  learn  Whereas 
the  Structured  Query  Language  (SQL)  is  inadequate  to 


answer  many  complex  queries,  data  mining  can  support 
searches  for  patterns  in  temporal  and  spatial  databases  in  a 
more  efficient  manner.  Data  mining  is  important  to  the 
military  because  commanders  and  the  analysts  who  support 
them  cannot  anticipate  all  future  uses  of  information  at  the 
time  of  data  collection. 

2.1  Limitations  of  data  mining 

Whereas  the  goal  of  data  mining  is  to  identify  hidden 
patterns,  the  search  algorithms  chosen  for  the  particular 
task  may  miss  an  important  and  interesting  pattern  or  even 
a  class  of  similar  patterns.  A  systematic  method  to  pre¬ 
clude  this  problem  is  not  available. 

Similarly,  there  is  no  guarantee  that  any  given  data 
mining  effort  will  yield  something  new  and  useful,  regard¬ 
less  of  how  many  well-designed  data  mining  tools  are 
used.  This  is  because  the  data  may  not  contain  the  desired 
patterns.  Data  mining  is  a  search  for  observational  data  and 
the  relationships  between  them,  rather  than  the  measure¬ 
ment  of  experimental  data. 

3.  Concept  of  operation 

The  concept  of  operations  for  a  future  system  based  on 
this  research  is  1)  to  use  data-mining  and  data-classification 
algorithms  to  detect  patterns  associated  with  attacks  (e.g. 
to  identify  factors  that  indicate  an  imminent  attack  in  the 
near  future)  and  2)  to  correlate  these  patterns  with  current 
events  with  a  view  toward  supplying  military  commanders 
with  a  prediction  of  the  next  attack  and  a  confidence  level 
that  pertains  to  that  prediction.  A  considerable  amount  of 
data  associated  with  events  that  have  preceded  known  at¬ 
tacks  is  required  to  model  attacks,  to  search  for  common 
features,  and  to  find  these  patterns  in  new  data. 

Success  in  this  effort  depends  on  a  characterization  cf 
the  circumstances  that  translate  to  well-defined  observables 
that  preceded  past  attacks.  The  more  detailed  the  available 
knowledge,  the  better  the  resulting  model,  and  the  greater 
the  probability  that  data  instantiating  critical  variables  can 
be  collected.  We  expect  that  such  detailed  data  for  all  vari¬ 
ables  will  not  be  available  prior  to  future  attacks  and  that 
all  available  data  may  not  be  useful  in  predicting  attacks 
(i.e.  will  function  as  “noise”  in  the  analysis).  Thus,  the 
task  involves  identification  of  algorithms  that  can  detect 
pre-attack  features  in  clutter  and  the  use  of  pattern  recogni¬ 
tion.  Modem  methods  of  statistical  pattern  recognition  are 
sufficiently  computationally  oriented  to  use  a  larger  dimen¬ 
sional  space  and  are  less  sensitive  to  noise  than  older 
methods.  Success  in  attack  prediction  will  depend,  at  least 
in  part,  on  how  well  these  methods  can  be  implemented 
with  the  available  data 


4.  General  approach 

Hostile  events  can  be  characterized  with  respect  to  as 
many  relevant  variables  as  are  deemed  necessary  and  avail¬ 
able  to  predict  future  attacks.  An  object-oriented  message- 
traffic  database  can  be  analyzed  for  the  occurrence  of  telltale 
signs  of  pending  attacks.  Our  objective  is  to  generate  an 
event  prediction  (in  terms  of  a  probability)  with  a  confi¬ 
dence  value  associated  with  it  Therefore,  it  is  necessary  to 
determine  the  combinations  of  events  and  observations  that 
will  have  a  higher  probability  of  indicating  a  future  attack. 
A  baseline  can  be  modeled  from  normal  operational  scenar¬ 
ios  and  from  military  events  during  times  of  conflict  that 
do  not  constitute  attacks  per  se. 

The  attack  alarm-generation  process  and  the  reduction 
of  false  positives  can  be  approached  using  constraints  from 
models  of  known  attacks.  The  identification  of  the  appro¬ 
priate  features  (and  groups  of  features)  that  can  flag  immi¬ 
nent  attacks  is  the  most  challenging  part  of  the  process. 
One  approach  is  to  explore  the  generation  of  a  knowledge 
base  encoded  in  Bayesian  networks. 

A  literature  search  was  conducted  for  publications  on 
various  subjects  that  relate  to  data-mining,  including  algo¬ 
rithms  and  their  applications.  Appendix  A  presents  the 
result  of  this  literature  search.  Data-mining  algorithms  can 
be  used  to  identify  complex  patterns  in  the  data  that  corre¬ 
late  well  to  hostile  events.  Criteria  can  be  developed  for 
sufficient  correlation  and  confidence  levels  in  data  associa¬ 
tions.  For  example,  one  metric  that  could  be  used  is  corre¬ 
lation  strength,  which  is  the  ratio  of  the  joint  probability 
to  the  individual  probability  of  observing  a  pattern  [3], 

5.  Bayesian  networks 

Bayesian  networks  can  be  used  to  classify  data  into 
categories.  Bayesian  networks  are: 

•  probabilistic  networks, 

•  directed  acyclic  graphs  that  encode  certain  de¬ 
pendences  between  nodes  that  represent  random  vari¬ 
ables, 

•  knowledge  bases  with  knowledge  in  the  network’s 
structure  and  in  its  conditional  probability  table,  and 
• structures  that  can  be  used  to  infer  causality. 

5.1  Naive  Bayesian  networks 

A  naive  Bayesian  network  is  a  very  simple  structure  in 
which  all  random  variables  representing  observable  data 
have  a  single,  common  parent  node  -  the  class  variable. 
The  naive  Bayesian  classifier  has  been  used  extensively  for 
classification  because  of  its  simplicity,  and  because  it  em¬ 
bodies  the  strong  independence  assumption  that,  given  the 
value  of  the  class,  the  attributes  are  independent  of  each 
other. 


Naive  Bayesian  networks  work  remarkably  well  consid¬ 
ering  that  this  independence  assumption  may  not  be  valid 
from  a  logical  standpoint  The  performance  of  a  naive  Bay¬ 
esian  network  can  be  improved  with  the  addition  of  trees 
that  provide  augmenting  edges  to  a  naive  Bayesian  net¬ 
work  by  representing  correlations  between  the  attributes. 

5.2  Tree  Augmented  Naive  (TAN)  Bayesian 
Classification  Algorithm 

The  Space  and  Naval  Warfare  Systems  Center,  San  Di¬ 
ego,  has  access  to  SRI  International’s  classifier  algorithms 
that  were  developed  under  the  Defense  Advance  Research 
Projects  Agency’s  High  Performance  Knowledge  Base 
Program  For  example,  SRI’s  Tree  Augmented  Naive 
(TAN)  Bayesian  Classification  Algorithm  is  a  classifier 
algorithm  based  on  Bayesian  networks  with  the  advantages 
of  robustness  and  polynomial  computational  complexity 
[4  and  5]. 

Bayesian  networks  have  some  drawbacks  that  SRI  has 
addressed  in  the  TAN  algorithm.  In  ordinary  naive  Baye¬ 
sian  networks,  the  variables  (data)  are  assumed  to  be  condi¬ 
tionally  independent  given  the  class.  Logically,  this  is  not 
always  true.  For  example,  suppose  enemy  troops  are  ob¬ 
served  at  location  X  and  enemy  tanks  are  observed  at  loca¬ 
tion  Y.  When  using  Naive  Bayesian  networks,  one  as¬ 
sumes  that  these  events  are  independent.  However,  both 
events  may  be  part  of  the  overall  enemy  battle  plan.  In  the 
TAN  algorithm,  the  trees  provide  edges  that  represent  cor¬ 
relation  between  the  variables. 

Bayesian  networks,  especially  with  tree  augmentation, 
are  a  suitable  technology  for  data-mining  classification  and 
event  prediction  for  the  following  reasons: 

•  First,  one  need  not  provide  all  joint  probability  val¬ 
ues  to  specify  a  probability  distribution  for  collections  cf 
independent  variables  [2], 

•  Second,  one  could  mix  modeling  (e.g.  explicit 
knowledge  engineering  for  knowledge  elicited  from  experts) 
with  statistical  data  induction  and  adaptivity.  This  mix 
would  require  fewer  data  values  to  induce  better  quality 
models. 

•  Third,  one  could  use  these  models  to  compute  the 
value  of  information.  For  example,  having  seen  signs  "AM 
and  "B"  of  an  imminent  attack,  what  is  the  best  informa¬ 
tion  to  collect  next  to  confirm  that  hypothesis? 

•  Fourth,  one  could  characterize  explicitly  the  kinds  <f 
attacks.  For  example,  given  an  attack  of  type  “Air  attack,” 
what  are  the  most  likely  signals?  These  signals  could  be 
collected  regularly  to  fill  the  database  used  as  input  into 
the  TAN  algorithm. 

The  TAN  algorithm  makes  some  tradeoffs  between  ac¬ 
curacy  and  computation.  It  approximates  a  probability  dis¬ 
tribution  using  some  constraints  on  the  complexity  of  the 
representation;  however,  it  is  extremely  fast  (low  polyno¬ 


mial),  efficient  (one  pass  over  the  data),  and  robust  (low 
order  statistics). 

The  TAN  algorithm  accepts  data  sets  as  input  and  in¬ 
duces  Bayesian  networks  as  output.  Specifically,  the  TAN 
algorithm  is  intended  to  be  used  as  a  classification  algo¬ 
rithm,  which  means  that  the  input  would  be  a  file  with 
tuples  of  the  form  {xt,  x2,  x3,  ....  x„,  c)  where  the  *,  are 
values  that  variable  X  takes  and  c  is  the  value  that  a  class 
(C)  variable  can  take.  To  set  the  range  of  each  variable,  the 
TAN  algorithm  needs  an  auxiliary  file  that  contains  a  de¬ 
scription  of  each  variable,  including  the  range  of  values 
representing  the  degree  of  intensity. 

The  TAN  algorithm’s  output  is  a  Bayesian  network 

encoding  of  P(C,  Xn . X,)  in  an  efficient  manner.  To  use 

TAN  as  a  classifier,  one  simply  computes  P(C|x',,,....,x'i). 
Given  a  new  vector  X'„,...,X'i  and  having  a  probability 
distribution  over  c,  one  can  select  the  event  with  highest 
probability  as  the  one  to  classify.  To  compute  the  confi¬ 
dence  in  this  value,  the  bootstrap  method  can  be  used  [6], 

The  TAN  algorithm  outperforms  naive  Bayesian  net¬ 
works  while  maintaining  its  robustness  and  computational 
simplicity  (polynomial  vs.  exponential  complexity). 

The  TAN  algorithm  captures  the  best  of  both  discrete 
and  continuous  attributes.  Therefore,  the  TAN  algorithm 
achieves  classification  performance  that  is  at  least  as  good 
as,  and  in  some  cases  better  than,  models  that  use  purely 
discrete  or  purely  continuous  variables.  Studies  at  SRI 
have  demonstrated  that  the  TAN  algorithm  performs  com¬ 
petitively  with  other  state-of  the-art  methods. 

TAN,  and  similar  algorithms,  can  be  made  to  perform 
the  classification  of  certain  battlefield  situations  for  the  Ma¬ 
rine  Corps.  Much  work  needs  to  be  done  in  this  area,  par¬ 
ticularly  with  regard  to  data-set  selection,  data  cleansing 
and  the  refinement  of  the  algorithm  to  meet  specific  needs. 

In  addition  to  the  TAN  algorithm,  SRI  has  more  gen¬ 
eral  algorithms  for  inducing  Bayesian  networks  that  do  not 
make  the  compromises  that  the  TAN  algorithm  does. 
These  algorithms  try  to  fit  the  best  distribution  possible 
with  no  constraints.  The  disadvantage  is  that  the  computa¬ 
tion  of  these  models  is  slower,  however,  this  may  be  ac¬ 
ceptable  and  desirable  in  some  cases.  Algorithms  can  be 
implemented  with  the  same  data  and  the  results  compared. 

5.3  GaussMeasurePredict  Program 

The  GaussMeasurePredict  program  was  developed  by 
Nir  Friedman  to  measure  the  performance  of  an  induced 
TAN  model.  (See,  for  example,  [4]).  The  input  of  Gauss¬ 
MeasurePredict  consists  of  the  following  items:  1)  an  in¬ 
duced  Naive  Bayesian  network  from  TAN,  2)  the  name  cf 
the  variable  to  predict,  and  3)  a  test  data  set  that  contains 
instance  information.  When  testing  the  Bayesian  network 
model,  the  variable  to  predict  is  specified  and  known  to  be 
correct.  Usually  this  will  be  the  outcome  of  the  class  vari¬ 
able. 


GaussMeasurePredict  also  has  the  option  to  calculate 
and  display  the  probability  of  each  class  value  for  each  in¬ 
stance  in  the  input  file.  This  feature  is  particularly  useful 
for  Receiver  Operating  Characteristic  (ROC)  curves  as  well 
as  for  determining  other  statistics  [7],  Thus,  with  this  op¬ 
tion,  GaussMeasurePredict  can  output  the  probability  dis¬ 
tribution  for  each  instance  in  addition  to  a  summary. 

The  output  of  GaussMeasurePredict  is  a  prediction  cf 
the  accuracy  of  the  network  in  the  TAN  Bayesian  network 
.bn  file.  It  can  be  used  to  predict  the  accuracy  of  other  clas¬ 
sifier  algorithms  as  long  as  the  output  file  matches  the  for¬ 
mat  of  TAN’s  Bayesian  network  file. 

GaussMeasurePredict  is  intended  to  be  used  to  measure 
the  accuracy  of  predictions  and  not  to  generate  predictions 
for  unlabeled  instances.  Unlike  the  TAN  algorithm, 
GaussMeasurePredict  does  not  accept  instances  with  “?” 
for  missing  values  an  instance  input  file.  All  variables 
must  have  filled  values  in  each  instance.  However,  because 
GaussMeasurePredict  compares  the  induced  Bayesian  net¬ 
work  to  the  test  data  set,  it  also  can  be  used  to  infer  the 
class  of  an  unknown  instance  by  filling  in  the  class  (Out¬ 
come)  variable  with  a  guessed  value.  Using  the  option 
described  above,  GaussMeasurePredict  can  output  a  pre¬ 
dicted  class  probability  for  each  class  value.  The  class  with 
the  highest  probability  is  the  predicted  class  for  that  in¬ 
stance. 

Fortunately,  in  the  simplest  case  of  attack  predictions, 
only  two  values  are  possible  for  the  class  variable: 
ATTACKLIKELY  and  ATTACKNOTLIKELY.  In 
more  detailed  cases  of  attack  predictions  in  which  specific 
attack  types  are  listed  in  the  data-definition  input  file,  the 
class  variables  may  assume  2N  values  where  N  is  the  total 
number  of  attack  types  considered  in  the  class.  (The  2N 
arises  from  including  the  negation  of  the  likelihood  of  an 
attack  of  each  type.) 

6.  Software  implementation  and  plans 

Data-mining  software  was  tested  for  correct  operation 
with  clean  data  sets  designed  specifically  for  testing.  The 
programs  described  below  are  included  in  the  research  envi¬ 
ronment.  The  software  includes  the  TAN  algorithm  and 
the  GaussMeasurePredict  that  uses  the  output  of  the  TAN 
algorithm.  Inputs  to  GaussMeasurePredict  must  be  com¬ 
plete.  Plans  for  include  the  acquisition  of  additional  algo¬ 
rithms  that  are  designed  to  operate  on  incomplete  data  sets. 

6.1  TAN  2.1  availability 

The  TAN  version  2. 1  software  and  user’s  manual  are 
available  for  download  via  file-transfer  protocol  (ftp)  from 
SRI’s  web  site:  http://edi.erg.sri.com/tan/rANintro.htm. 
The  user  is  required  to  register  with  a  name  and  password. 
To  obtain  the  TAN  algorithm,  Netscape  is  recommended 
and  may  be  required.  The  Solaris  CDE  web  browser,  Hot- 


Java,  is  not  recommended  to  download  TAN.  The  TAN 
user  manual  is  included  with  the  software  (See,  for  exam¬ 
ple,  [7]). 

The  TAN  software  was  downloaded  from  SRI’s  web 
site  onto  a  Solaris  SPARC  Station  20  computer  running 
the  Solaris  2.7  UNIX  operating  system  and  using  the 
Common  Desktop  Environment  (CDE). 

TAN  2. 1  constitutes  the  main  data-mining  tool  in  the 
research  environment  of  this  project.  TAN  can  be  used  as  a 
base  classifier  and  also  as  a  method  to  fuse  the  output  cf 
other  data-mining  and  classification  algorithms.  When 
algorithms  have  been  tested  and  programmed,  data  visuali¬ 
zation  tools  can  be  identified,  tested  and  used  to  view  the 
data  and  to  continue  the  pattern-recognition  process. 

6.2  GaussMeasurePredict  Availability 

The  GaussMeasurePredict  program  is  available  along 
with  the  TAN  software  from  SRI’s  web  site.  It  is  included 
with  the  TAN  package  and  can  be  executed  when  files  are 
“unzipped”  and  when  the  appropriate  input  files  are  avail¬ 
able. 

7.  Object-oriented  data  implementation 

The  object  model,  on  which  the  Integrated  Marine 
Multi-Agent  Command  and  Control  System  (IMMACCS) 
Database  is  based,  is  a  detailed  representation  of  the  battle- 
space  with  objects  derived  from  the  March,  1998  Urban 
Warrior  Advanced  Warfighting  Exercise  [1  and  8],  Object 
attributes  and  their  associations,  as  well  as  class  inheri¬ 
tance  also  are  described  in  [8].  The  IMMACCS  Database 
uses  the  Unified  Modeling  Language  symbolic  representa¬ 
tion  method  [8]. 

The  IMMACCS  Database  includes  in  its  structure  the 
following  topics  of  interest  to  the  Marine  Corps:  aircraft; 
ground  vehicles;  sea-surface  vehicles;  weapons  and  weapon 
systems;  electronic  devices  of  maty  kinds;  terrain;  bodies 
of  water,  logistics  information;  transportation  infrastructure; 
various  specialized  units;  personnel  data,  and  most  impor¬ 
tantly  for  this  application,  military  events.  Class  inheri¬ 
tance  paths  and  allowed  values  are  specified  [8].  The  use  cf 
an  object-oriented  database  and  the  representation  of  mili¬ 
tary  entities  in  object  form  provide  a  degree  of  interoper¬ 
ability  and  extensibility  that  allows  multiple  services  to 
use  and  add  to  this  common  tactical  picture  [1], 

The  data  sets  for  this  data-mining  effort  will  come  from 
IMMACCS.  The  class  structure  in  the  IMMACCS  data 
model  is  especially  well  designed  for  adaptation  to  the 
attack/non-attack  classification  task.  When  data  fill  be¬ 
comes  available,  especially  for  the  attributes  and  object 
classes  of  interest,  the  IMMACCS  database  will  be  a  very 
desirable  data  source  for  reasons  described  in  the  next  sub¬ 
section. 


7.1  Construction  of  training  data  sets 

The  following  discussion  illustrates  the  strategy  fix 
constructing  training  data  sets  using  certain  IMMACCS 
object-oriented  data  classes  as  examples.  The  data-mining 
classification  task  is  to  identify  the  value  of  the  Bayesian- 
network  class  variable  of  an  unknown  data  set.  Initially, 
two  Bayesian-network  class  variables  will  be  considered, 
“imminent  attack  likely,”  or  “imminent  attack  not  likely.” 
To  train  the  TAN  algorithm,  the  value  of  the  Bayesian- 
network  class  variable  will  be  identified  in  the  training  data 
sets  for  both  classes. 

Various  types  of  attacks  and  defenses  are  listed  as  al¬ 
lowed  values  (among  others)  in  the  MILITARY_EVENT 
object  class  in  the  IMMACCS  Database.  These  are 
AIR_ATTACK,  GROUND_ATTACK,  AIRDEFEN SE, 
GROUND_DEFENSE,  and  SM ALL_S  CALL_ATTACK. 
Only  instances  that  correspond  to  attacks  from  hostile 
forces  on  the  Marine  Corps  will  be  considered.  Any  attack 
launched  by  the  Marine  Corps  on  hostile  forces  will  not  be 
counted  in  the  “attack”  category.  In  contrast,  defenses  by 
the  Marine  Corps  against  hostile  attacks,  whether  the  at¬ 
tacks  are  launched  from  the  air  or  the  ground,  are  likely 
play  a  role  in  the  overall  model  when  they  influence  subse¬ 
quent  enemy  attacks.  For  example,  enemy  commanders 
may  select  a  battle  plan  that  does  not  involve  an  air  attack 
on  an  area  with  a  strong  Marine  Corps  air  defense. 

Several  naive  Bayesian  networks  can  be  induced,  one 
for  each  attack  type  and  one  for  the  combined  data  for  all 
attack  types.  For  the  combined  attacks,  the  class  variable 
can  take  multiple  values,  corresponding  to  the  likelihood 
of  a  particular  attack  type,  and  the  likelihood  that  this  at¬ 
tack  type  will  NOT  occur.  Initially,  all  attack  types  will  be 
assumed  to  be  independent,  although  this  is  rarely  true  in 
actual  battles.  For  example,  ground  attacks  are  more  likely 
to  follow  air  attacks  at  the  same  location  than  vice  versa. 

For  the  non-attack  training  instances,  data  associated 
with  the  other  values  of  the  MILITARYJEVENT  object 
class  will  be  used,  such  as  WITHDRAWL_EVENT, 
DELAYTNGACTION,  AIRREINFORCEMENT  or 
DRJLL_EVENT.  Other  non-attack  training  instances  also 
can  be  derived,  for  example,  from  the  AIR_DEFENSE  and 
GROUND_DEFENSE  values,  provided  the  instances  per¬ 
tain  to  events  associated  with  enemy  air  defenses  and 
ground  defenses. 

The  date-time  groups  (DTGs)  associated  with  each  in¬ 
stance,  both  of  attack  and  non-attack  situations,  will  be 
noted  and  other  data  objects  with  the  same  DTGs  (and 
with  DTGs  just  prior  to  the  event)  will  be  included  in  the 
training  data  sets.  The  training  data  also  could  include 
objects  present  in  the  same  vicinity  as  the  attack  or  non¬ 
attack  event  that  do  not  have  DTGs.  This  will  provide  as 
comprehensive  a  description  of  the  battlespace  at  the  time 
and  place  of  the  attack  as  is  possible,  given  the  level  <f 
data  granularity.  This  method  of  formulating  training  data 


sets  can  be  extended  by  including  in  each  data  set  the  data 
that  pertain  to  DTGs  several  days  prior  to  the  event  to 
ascertain  whether  this  will  yield  better  results.  The  exact 
time  span  that  each  data  set  should  cover  is  an  open  re¬ 
search  issue. 

7.2  Design  considerations  in  the  construction 
of  test  data  sets 

Changes  can  be  made  in  the  test  data  sets,  depending 
on  the  desired  outcome  of  the  test  For  example,  to  deter¬ 
mine  how  far  in  advance  an  attack  can  be  predicted,  the 
instances  that  pertain  to  an  entire  day  immediately  prior  to 
the  attack  can  be  omitted  systematically  from  test  data  sets. 
If  the  algorithm  still  makes  the  correct  prediction,  one  can 
conclude,  at  least  as  for  as  that  test  data  set  is  concerned, 
that  an  attack  can  be  predicted  24  hours  in  advance.  Simi¬ 
larly,  if  two-days  worth  of  data  immediately  preceding  the 
attack  can  be  omitted  without  a  significant  decline  in  the 
prediction  accuracy,  this  is  an  indication  that  attacks  can  be 
predicted  48  hours  in  advance. 

We  expect,  however,  that  omitting  more  and  more  data 
that  pertain  to  the  days  just  prior  to  an  attack  will  cause 
the  attack-prediction  accuracy  to  degrade.  The  exact  func¬ 
tionality  of  this  degradation  (linear,  exponential,  logarith¬ 
mic,  etc.)  is  another  open  research  question.  This  type  cf 
testing  can  enable  researchers  to  determine  the  number  cf 
days  to  include  in  the  data  collection  and  the  specific  data 
elements  to  be  collected  that  are  necessary  to  formulate  as 
accurate  a  prediction  as  possible. 

Test  and  training  data  sets  will  be  formulated  according 
to  an  n-fold  cross-validation  procedure.  For  example,  to 
implement  the  first  cycle  of  a  5 -fold  cross  validation  with  a 
data  set  consisting  of  1,000  records,  the  first  800  records 
can  be  selected  for  training  with  the  last  200  records  being 
reserved  for  testing.  During  the  second  phase  of  training 
and  testing,  the  first  600  records  and  the  last  200  records 
together  will  comprise  the  test  data  set  and  the  remaining 
records  will  be  used  for  testing.  In  the  third  phase,  the  first 
and  last  400  records  will  be  used  for  training  and  the  mid¬ 
dle  200  for  testing,  etc.  The  advantage  of  this  procedure  is 
that  it  can  be  used  to  identify  anomalies  in  the  testing  and 
training  so  that  if  the  results  are  comparable  for  all  five 
tests,  a  higher  level  of  confidence  in  the  method  is  ob¬ 
tained. 

8.  Conclusion 

This  paper  describes  a  data-mining  environment  de¬ 
signed  to  support  wartime-event  prediction  using  Bayesian 
networks  to  perform  a  data-classification  task.  The  TAN 
algorithm  was  selected  to  induce  a  network  using  data  ex¬ 
tracted  from  an  object-oriented  database  that  contains  in¬ 
formation  from  exercise  message  traffic.  Future  work  could 
include  a  user-friendly  interface  designed  on  top  of  the  algo- 


rithms  to  provide  automated  input  of  selected  data  sets  to 
the  algorithm  of  choice.  Success  in  this  research  project 
will  pave  the  way  for  a  more  a  precise  indication-and- 
waming  system  for  the  U.S.  Marine  Corps. 
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